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Abstract 

We give tight lower and upper bounds on the expected missing 
mass for distributions over finite and countably infinite spaces. An 
essential characterization of the extremal distributions is given. We 
also provide an extension to totally bounded metric spaces that may 
be of independent interest. 

1 Introduction 
1.1 Background 

Let S he a countable set endowed with a probability measure P. Suppose 
that Xi, . . . ,Xt are drawn independently from S according to P. Define the 
missing mass Ut as the following random variable: 

Ut = P{S\{Xi,...,Xt}). (1) 

In words, Ut is the total probability mass of the elements of S that were not 
observed in the t samples. 

The missing mass is a quantity of interest in almost any application 
involving sampling from a large discrete set, whether it be fish in a pond 
or words in a language corpus. (Famously, Alan Tu ring develop ed what 
became known as Good- Turing frequency estimators ( Good . 19531 ) as part 



of his wo r k on cracking the Enigma cypher during WWII; see the account 
(|Goodl . [2000l )). We note right away that 



m 



se5 



where ps = P{{s}) and that E[Ut] — )• as t — )■ oo (the latter follows from 
Lebesgue's Dominated Convergence Theorem). Observe also that Ut ^ 
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almost surely as i — >• oo; one way of seeing this is to apply the deviation 
inequality 



P[\Ut-B[Ut]\>£] < 2e-*^ (2) 



of McAllester and Ortiz ( 20031 . Theorem 16) together with the Borel-Cantelli 



Lemma. 

The topic of interest of this paper is the rate of decay of E[[/t]. For 
example, when S is finite, we have the trivial estimate 



nUt] < (I-P: 



mm J 



where pmin = minsesPs and we assume without loss of generality that P 
has full support. Of course, this bound is not distribution-free, since it 
depends on Pmin- Is a distribution-free estimate possible, at least for finite 
5? What about countable 5? How about lower bounds on the decay rate 
of the missing mass? These and related questions are investigated in this 
paper. 

1.2 Related work 

The missing mass problem is an unavoidable feature of density estimation, 
where non-zero density must be assigned to unobserved regions. The pro- 
cess of transferring some of the probability mass from observed points to 
unobserved one s is known as smoothing, and Laplace's "add one" estimator 
(jLaplacel . Il814l l appears to be the earliest smoother. Smoothing is now an 



indispensable component of density estim ation; dozens of methods have bee n 



proposed for d i screte densities alone (cf. iKrichevsky and Trofimovl ()198ll ): 
Orlitskv et all ^200± ). 



Since a review of smoothing methods is beyond the scope of this paper, 
we briefly mention the celebrated Good- Turing estimator for the missing 
mass. Given the sample X = {Xi, . . . ,Xt}, define Y^^^ C X to consist of 
those Xi that occur exactly once. The Good- Turing missing mass estimator 
is given by 



t 



y{i) 



this is the proportion of frequency-one elements. An attractive feature of 
this estimator is its diminishing bias: 

nut]-nut] = ^E[f/W] (3) 
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where uj^^^ = P{Y^ ''') is the random variable corresponding to the total 
mass of the frequency-on e item s; this variant of Good's theorem is proved in 
McAllester and Schapire (|2000l . Theorem 1). Additionally, both the missing 



mass Ut and its estimate Ut are tightly co ncentrated about their expectat ions 
(inequality ([2]) establishes this for Ut; see McAllester and Schapire ( 2000l ) for 
other deviation estimates). 



2 Main results 

Although ([2]) and ([3]) provide a computationally efficient estimator of Ef/^, 
they do not yield any a priori information about the magnitude of the latter. 

To state results, it will be convenient to define the plateau length £ of a 
probability distribution P over N: 

i{P) = sup |{i G N : a/2 < Pi < a}\ (4) 

0<a<l 

where pi = P{{i}). (Note that i{P) = oo is possible.) 

Our first two results deal with upper and lower bounds on EUt- We use 
the notation [n] = {1, 2, . . . , n} throughout the paper. 

Theorem 1. The expected missing mass is bounded above as follows: 

(i) For n G N and S = [n], 



BUt < 
(ii) For S = N, 

BUt < 

where c is a universal constant. 



e"*/", t < n, 



m 

ct ' 



Remark 2. It is possible to somewhat (but not by much, see Proposition 
\^\) improve the bound in (i) in some regimes of n and t; this will become 
apparent from our proofs. The bound ^ holds everywhere, but is vacuous 
when et < n. A slightly better bound of n(l — l/n)"/t was o btained by 



R. B oppana in a very elegant way, in response to our question liBoppand . 



201 i ). We took an entirely different route, which also basically characterizes 



the extremal distribution. 
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Proposition 3. The estimates in Theorem [7] are essentially tight: 
(i) For each n G N and t > n, there is a distribution on [n] such that 

n — 1 
BUt > a , 

where c is an absolute constant. 

(a) For each integer a > 1, there is a distribution P over S = N such that 
i{P) = a and 

a 

Et/t > t> a, 

where c is an absolute constant. 

Furthermore, if we allow distributions with infinite plateau length, then 
no nontrivial uniform (or even pointwise) bound on Ef/j is possible: 

Proposition 4. For any sequence 1 > ri > r2 > . . . decreasing to 0, there 
is a distribution on S = N such that 'EUt > rt for all t > 1. 

Next, we turn our attention to extremizing distributions for finite S. 
These turn out to exhibit a fairly regular behavior, with essentially a single 
phase transition. Since EC/j is a symmetric function of the {pi}, we hence- 
forth assume that pi < P2 < ■ ■ ■ < Pn- In the sequel, the vector {pi, ■ ■ ■ ,Pn) 
will be denoted by p. 

Theorem 5. Let \S\ = n < oo. Then 

(i) Every local maximum p* ofEUt is of the form 

P*1=P*2=P*3 = ■■■= P*n-1 < P*n 

(that is, p* consists of one "heavy" element andn — 1 identical "light" 
ones, where the possibility of heavy =light is not excluded). 

(a) There exists a threshold t = T{n) > n such that: 

(a) For t < T, there is a unique global maximum 

P\=P*2=PI = ■■■= Pn^l =P*n='^- 

(b) For t > T, there is a unique global maximum and it has the form 

P\=P*2=PI = ■■■ =K-i <P*n- 
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(in) As n ^ oo, 

T = n + V2n{l + o{l)). 

(iv) For t > n + 

Remark 6. We have not excluded the possibility that for t = t, both of the 
distributions described in (ii) attain the maximum. (Of course, this seems 
highly improbable.) 

3 Proofs 

Lemma 7. Consider the function f{x) = x(l — x)* on the interval [0, 1] for 
an arbitrary fixed t > 0. 

(i) For t > 0, f increases on (0, + 1)), decreases on (l/(t + l), 1), and 

achieves its maximum at x = l/{t + 1), where it is bounded above by 
j_ 

ef 

(ii) The derivative f decreases on (0,2/(t + 1)) and increases on {2/(t + 

Proof. (i) A simple calculation shows that i^' > on (0, l/{t + 1)), g' < 
on + 1), 1), and g' = at x* = t/ {t + 1). Substituting, we obtain 

If. 1 \* If. 1 1 



g{x*) = 1 = - 1 < . 

' t + l\ t + lj t \ t + lj et 
(ii) Routine. 

□ 

of TheoremUl (i) It follows from ([9]) below that, for t < n, the expected 
missing mass is maximized when p^ = P2 = ■ ■ ■ = p^ = ^/n, yielding 
the bound 

EC/t = (l-l/n)*<e-*/". 
For general t, an application of Lemma [7] yields 

n 

EUt = J2p^.{i-PiY<^^. 

i=l 
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(ii) Let P be a distribution on S = N. Then 



i=l 

Piii-PiY+ Yl Pi{i - PiY = El + E2 

i:pi>l/{t+l) i:pi<l/{t+l) 

By Lemma [71 

i:pi>l/{t+l) 
Llog2(t+l)J 

= E E Pii^-P^Y 

j=0 i:23/{t+l)<pi<2J+'^/{t+l) 

Liog2(i+i)J 2J / 2-'' \* 



* + l V t + lj 

^ 7TT E 2^exp(-2^t/(t + l)) 

„.ps Llog2(t+l)J 



< C'^, (6) 

t+l ^ ' 



for appropriate absolute constants c!,c!' . 
An analogous argument shows that 



t + l 

Combining ([6]) and ([7]), we obtain the claim. 



E2 < c'"^. (7) 



□ 
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of Proposition O (i) Define the distribution p by 

X=pi=p2 = ...= Pn-l <Pn = l- {n- l)x, 

where x = + 1). Then 

n-1/ 1 Y n-l f 1 \*+^ 8(n-l) 
EUt > 1 = 1 > -. 

(ii) For any a G N, define p as follows: 

1 1 

Pl=P2 = ■■■ = Pa = -■,Pa+l = ...=P2a = J^;--- \Pka+l = ■■■= P{k+l)a 



Then, denoting k = [log2(t/a)] , we have for t > a: 

* ~ 2^2^i ~2^i ^2^1 ~2^l ~ 2 " 2'^-! V ^ 



k=l 
> ^. 

- 27t 

□ 

of Proposition^^ Let 1 > ri > r2 > . . . be a sequence decreasing to 0. 
Observe that 



Pi 

/t2 



Select r > 10 such that r-j- < 0.9. Then we can choose (pi) so that 



Indeed, (1 - l/t^)* > 0.9 for t > r > 10. Thus, for t = r, (HD is satisfied 
by any (pi) with pi < I/t"^ for all i G N. For each t > t, choose a finite 
sequence {pn) such that < l/(i + 1)^ for each i and 



E 



Pij = rt+i - rt. 



Let p be the distribution obtained by concatenating all the sequences {pit)t>T 
and the number 1 — r^-. To prove the claim for t < t, let us define the 
following "doubling operator" on distributions: 

D{ipuP2, ...)) = {pi/2,pi/2,p2/2,p2/2, . . .). 



It is straightforward to verify that for all distributions p and all t G N, 

E^kpUt 1 as /c — )• cxD 

(where the subscript of E specifies the distribution under which the expec- 
tation is taken). Thus, if p is a distribution that satisfies ([S]) for all t > t, 
there is some finite k such that D^p makes the proposition hold. □ 

of Theorem\^ (i) For n, t E N, define F : [0, 1]" M by 

n n 



i=l 



(where f{x) = x{l — x)*). An elementary application of Lagrange 
multipliers shows that, under the constraint Y^^=i Xi = 1, a necessary 
condition for an extremum is 



dF _ dF 

dxi dxj 



i,j £ [n\. 



Lemma [7] leaves two possibilities for an extreme point p*: either all 
the p* take the value 1/n (we call such distributions univalent) or 
the p* take two values vr < vf, with /'(vr) = /'(vf) < (we call such 
distributions bivalent) . 

In the bivalent case, we have, without loss of generality, 

P*1=P*2 = ■■■=PI=T^ <T^= PI+1 = PI+2 = ■■■=Pn 

for some 1 < k < n. Define the Lagrangian 

L(x,A) = F(x) + A(5(x)-l), 

where ^(x) = Y17=i ^« associated (n + 1) x (n + 1) matrix 

H = H{x, A), where 



Hi 



dxidxj 

dxidxj 
dg _ 

dXi ' 

og — I 

dxi ' 



t{x,{t + 1) - 2){l - Xi) 
0, 



t-2 



[ 0, 



i = j <n, 

iy^j<n, 

i < n, j = n + 1, 

j < n,i = n + 1, 

i = j = n + 1. 
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Suppose k < n — 2 and consider the 3x3 lower right submatrix 

/ t{Xn-lit + 1) - 2){l - Xn-lY'^ 1 

B = B{x) = t(xn(t + l)-2)(l-x„)*-2 1 

V 1 10 

note that our assumption on k forces Bn = i?22- The second-order 
necessary condition for p* to be a local maximum is that a sequence 
of bordered Hessians, including det(i?(p*)), be nonnegative. Since 
/'(tt) = /'(vf) < and /' decreases on (0,2/(t + 1)) and increases on 
(2/(t + 1), 1), it follows that vr < 2/{t + 1) and vf > 2/{t + 1). This 
implies that Bu = B22 > 0. Denoting this common value by b, we 
have 

det{B) = -2b < 0. 

The necessary condition is thus violated, leaving two possibilities: the 
univalent case p* = 1/n and the bivalent case with k = n — 1: 

1 2 

<pI=P*2 = ■■■= Pn-i < — — < Pn- (9) 



t+1 .n-i - 

(ii) We saw above that EUf is always maximized by a distribution p of 
the form 

X=pi=p2 = ...= Pn-l <Pn = l-{n- l)x. 

For distributions of this form, we have 

BUt = Gt{x) = {n- l)x{l - xf + (1 - (n - l)x)((n - l)x)\ (10) 

where Gt is defined on [0, 1/n]. Note that x = 1/n corresponds to 
the univalent (uniform) distribution, while x < 1/n corresponds to a 
bivalent distribution. 

We claim the existence of a function r : N — )• N such that 

(a) for t < T{n), Gt has the unique maximizer x* = 1/n 

(b) for t > T{n), Gt has the unique maximizer x* < 1/n. 

(In principle, it may be possible for Gt to have two distinct maxima 
on [0, 1/n] for t = T(n), but this is rather implausible.) 

For t < n, ^ implies that Gt has the unique maximizer x* = 1/n; 
this shows that T(n) > n (if the function r described in (a) and (b) 
exists at all). 
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Now define the function Rt{x) = Gt{x)/Gt{l/n). Then 



For X < 1/n, the first term on the right-hand side of (lllh grows expo- 
nentially with t, and certainly Rt{x) > 1 is achieved for some finite t. 
But this means that Gt has a unique maximum at some x < 1/n and 
so any function T(n) satisfying (a) and (b) must be finite for all n. 

Suppose that t is such that Gt achieves a maximum at x < 1/n. 
It follows from the uniqueness proof below and from (jl5p that the 
maximizer x* of Gt is contained in the interval It = + 1), l/t). 
We claim that Rt{x) > 1 implies Rt+i{x) > 1 for all x ^ It; from here, 
the existence of r satisfying (a) and (b) follows immediately. Treating 
t as a continuous variable, we have 

dRt(x) . f 1 — x\ f 1 — x\ , , , . 

= (n — l)x log Y r + (1 — — l)x) log [nx) 



dt 

We establish the monotonicity claim by showing that 
dRt{x) 



dt 



> 0, t>n+l,x£lt. (12) 



Indeed, appealing to the inequalities ^*log^ > — ^ for < ^ < 1 and 
log — lfor^>l (checked by elementary calculus) , we see that 
the inequality 

/ 1-x \ l-(n-l)x 
(n - l)x — 1 > 



1-1/n J et 

is even stronger than (I12p . The latter will hold as long as 

—entx"^ + (et + n — l)x — 1 > 0, 

and it suffices to verify the inequality at the endpoints l/(t + 1) and 
l/t of /t, which is straightforward. This proves the existence of r as 
claimed in (a) and (b). 

Uniqueness is established by noting (again, via elementary though 
rather tedious calculus) that G't vanishes at x = 1/n and at not more 
than two points in the interval [l/(t + 1), 1/n), and is strictly positive 
on [0,l/(t + l). 
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(iii) For n G N and t ^ T{n), let x* be the maximizer of the function G 
defined in (jlOp . where we have dropped the subscript t. A sufficient 
condition for G{x*) > (l - i)* to hold is G {\) > (l-^)*. The 
latter, in turn, will hold as long as (n — 1)(1 — 1/tY/t > (1 — 1/n)*. We 
will show that the latter inequality holds for large n, if i = n + ^/2n. 
To this end, define the function 



-I / 1 \ Jt+\/27i / -I \ n+\/2?i 

n — I I ^ 1 \ 1 



g{n) = = 1 = - 1 - - 

n + V2n V n + V 2n / V " 

For 1/ G (0, 1), define g{i') = g{l/i^) and expand it about i/ = 0: 

3e 

Since for this choice of t we have G{l/t) - (1 - 1/n)* = n+in-'^/"^), it 
follows that 

T{n)<n+V2n, n » 1. (13) 
To get a lower bound on r, we estimate G{x*) from above: 

Now let 



(1 - l/nf 

n-l ^l-l/(t + l)V ^ / n-l\ / (n-l)/t V^^ 



t + iV 1-1/n y V * /Vi-i/^ 

and note that Qt{n) < 1 implies t < T{n). Let us put t = n+(l— e)\/2n 
for some < e < 1, and observe that for this choice of t, the second 
term on the right-hand side of (jl4p is negligible: 



1 



V 1-1/n J ^ V 1-1/^^ J ~^~t^ ~\ ~ 



< exp(— (t — n)) = exp(— (1 — e)\/2n) 



Let us now examine the asymptotic behavior of the first term in p4p . 
To this end, define the functions 

n-l [I - l/(t + 1)^ * 



t + l \ 1-1/n 
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and h{i') = h{l/v), and expand about = 0: 

h{v) = 1 - {2-e)ev + 0{v^'^). 
Hence, for this choice of t, we have 

,,'^^f,\, < 1 - (2 - e)en-^ + 0{n-^/^) + exp(-(l - e)V^) < 1 
(1 — 1/n)'' 

for sufficiently large n. This, combined with (jl3p . implies 
r(n) =n+(l + o(l))\/2n. 

(iv) We claim that in the bivalent case, the maximizer x* of G satisfies 



t + 1 t 



(15) 



The first inequality follows from Q. To verify the second inequality 
it suffices to show that G'{l/t) < 0. Indeed, 



G'jl/t) 
t{n - 1) 



< 



t t-1 



1 



< 



< 



< 



< 



t{t + l) 



1 

1 - - 
t 



1 

1 - - 

t 

t-i 



t-i 



+ 



+ 

n — 1 
t 



n — 1 
t 



.t 



t(t + l)(l-lA) V t 



Vt) 

1 



< 



n 



e(t + l)(i-l) - \ t 
1 



t-i 



1 



n 



n + \/2n 



n— 3 



< 0. 



To establish our claim, we seek a small 5 > such that G'{l/{t + 1) + 
5) < 0; any such 6 will yield the bound 



< X* < 



+ 6. Putting 



p = + 6, we have 

= _5(t + i)(i_p)*-i + (t_n + l-(n-l)5(t + l))((n-l)p)*-i 

n — 1 

< -6{t + 1)(1 - pY~^ + (t - n + l)((n - l)p)*-i 

< -6t{l - pf-^ + ti{n - 1)pY-\ 
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Thus, to ascertain that G'ip) < it suffices to show that 



t-i 

l-p J 

It follows from (jlSp that we may take p < 1/t, and hence 



l-p J \l-l/{t + l) J \ t"^ J \t 

Our assumption that t > n + \f^2jn implies 



< < exp(-vW2). 



Thus, we may take 5 = e 



/n/2 



□ 



4 Application: missing mass in metric spaces 

If P is a nondegenerate continuous distribution, then the missing mass as 
defined in ([T]) is trivially 1 for all t G N. To define a nontrivial extension of 
this notion to continuous spaces let us start with a metric probability space 
d), whose cT-field is induced by the metric topology. For a; G A", let 
to be the e-bah about x: B^ix) = {y £ X : d{x,y) < e} . For S C X, 
define its e-envelope, 5^, to be 

5, = IJ B,{x). 

For e > 0, define the e-covering number, A^(e), of A" as the minimal cardinal- 
ity of a set E d X such that X = Ee- A space is totally hounded if N{£) < oo 
for all e > 0. Define the e-missing mass of the sample S = {Xi, . . . , Xt} as 
the random variable 



Ut{e) = P{X\Se). (16) 

The expected e-missing mass of totally bounded spaces is controlled via the 
covering numbers: 

^ This problem is of interest in anomaly detection applications (|Kontorovich et all . 

[mH). 
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Theorem 8. In a totally hounded metric probability space {X,P,d), 

N{e) 



et 



Proof. For a fixed e > 0, let {ei, 62, . . . , e„} be an e-net for X. For i = 
1, . . . ,n, put Pi = P{Bfr{ei)); note that YlPi ^ 1- Then, invoking Lemma[71 
we have 

n 

^Ut{e) < Y,p,{l-p,Y<-. 

^ et 

1=1 

□ 
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